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Abstract 

In this paper we investigate statistical model compression ap¬ 
plied to natural language understanding (NLU) models. Small- 
footprint NLU models are important for enabling offline sys¬ 
tems on hardware restricted devices, and for decreasing on- 
demand model loading latency in cloud-based systems. To com¬ 
press NLU models, we present two main techniques, parameter 
quantization and perfect feature hashing. These techniques are 
complementary to existing model pruning strategies such as LI 
regularization. We performed experiments on a large scale NLU 
system. The results show that our approach achieves 14-fold re¬ 
duction in memory usage compared to the original models with 
minimal predictive performance impact. 

Index Terms: natural language understanding, model compres¬ 
sion 

1. Introduction 

Voice-assistants with natural language understanding (NLU) 
flj . such as Amazon Alexa, Apple Siri, Google Assistant, and 
Microsoft Cortana, are increasing in popularity. However with 
their popularity, there is a growing demand to support availabil¬ 
ity in many contexts and wide range of functionality. 

To support Alexa in contexts with no internet connection, 
Panasonic and Amazon announced a partnership to bring of¬ 
fline voice control services to car navigation commands, tem¬ 
perature control, and music playback (2). These services are 
“offline” because the local system running the voice-assistant 
may not have an internet access. Thus, instead of sending the 
user’s request for cloud-based processing, everything including 
NLU has to be performed locally on a hardware restricted de¬ 
vice. However, cloud-based NLU models have large memory 
footprints which make them unsuitable for local system deploy¬ 
ment without appropriate compression. 

Furthermore, to support wide range of functionality, Ama¬ 
zon Alexa and Google Assistant support skills built by external 
developers. Each skill has NLU models that extend the func¬ 
tionality of the main NLU models. Since there are many skills, 
their NLU models are loaded on demand only when needed to 
process user request 0. If the skill NLU model sizes are large, 
loading them into memory adds significant latency to utterance 
recognition. Thus, small-footprint NLU models are important 
for providing quick NLU response and good customer experi¬ 
ence. 

Typically NLU models consist of domain classification 
(DC), intent classification (IC) and named-entity recognition 
(NER) models. DC predicts the general domain class of a 
user utterance such Music, Shopping, and Cinema. IC pre¬ 
dicts the user intent within a domain such as PlayMusicIntent, 

*The authors have equal contribution to this work. The names are 
in alphabetical order. 


Buyltemlntent, or MovieShowTimesIntent. And, NER recog¬ 
nize domain-specific named-entities such as artist name and 
song name for the Music domain, and item name and product 
type for the Shopping domain. 

In this paper, we investigate statistical model compression 
for NLU DC, IC, and NER models. We use n-gram maximum 
entropy (MaxEnt) 0 models for DC and IC, and n-gram con¬ 
ditional random fields models (CRF) (5) for NER, but this work 
can be extended to any type of model with large number of fea¬ 
tures. We aim to reduce the large scale MaxEnt and CRF models 
memory footprint to enable local voice-assistants and decrease 
latency of loading skill NLU models in the cloud. We present 
two main techniques, parameter quantization and perfect hash¬ 
ing. We demonstrate these techniques" effectiveness with both 
empirical and theoretical justification. Also, we detail the trade¬ 
offs of time, space, and predictive performance. 

2. Background and Related Work 

Various methods have been proposed to reduce the memory and 
CPU footprint of machine learning models for image classifica¬ 
tion mm, keyword spotting mm, language models 1101 1111 . 
acoustic models fl2l| T3l. and text classification | :T4lll51 l. These 
methods fall into three classes, (i) Pre-processing methods - 
these include classic dimensionality reduction techniques like 
principal component analysis, feature hashing 03, and random 
projection lT7l [l8l as well as deep autoencoders 01 and sparse 
autoencoders |20l . (ii) Learning algorithm methods - this is 
where the learning algorithm itself is programmed to produce a 
small model. Examples include LI-regularization, greedy step¬ 
wise feature selection, boosting of small-simplified models, and 
synaptic-pruning ED. (iii) Post-processing methods - these in¬ 
clude methods such as parameter quantization 0 and data rep¬ 
resentation optimizations 171 , 

Commonly, pre-processing and learning algorithm methods 
are already incorporated into the cloud model building process 
for a voice-assistant, so in this work, our efforts are primarily 
directed towards the post-processing methods. Parameter quan¬ 
tization has been shown to be effective for reducing memory 
footprint for both traditional and deep models. However, as far 
as we know perfect hashing has not been applied to MaxEnt and 
CRF compression, but only to language models ED. 

3. Model Compression Approach 

3.1. Objective 

Our primary objective is to design algorithms which take 
large statistical NLU models and produce models which are 
equally predictive but have smaller memory footprint. This 
post-processing compression allows for reusing existing model 
building configurations and pipelines without maintaining sep¬ 
arate ones for small-footprint models. 



We evaluate the statistical model size reduction along three 
dimensions: time , space , and predictive performance. Time 
refers to the computational runtime complexity to perform a 
prediction. Space is measured as the number of bits required 
to store the model in memory. We use the term predictive per¬ 
formance to refer to the evaluation metric of choice such as FI 
score and accuracy. Challenges arise in balancing the tradeoffs 
across these three dimensions as often improving on one will 
cost on the other two. For example, improving model predic¬ 
tion performance may require larger models and slower decod¬ 
ing time; while an effort to reduce decoding time may degrade 
predictive performance and increase the memory needed. Thus, 
it is important to find the best tradeoff for any given application. 

3.2. Our Techniques 

We propose two techniques to perform statistical model com¬ 
pression quantization and perfect hashing. Individually, these 
approaches yield moderate model size reduction, but we com¬ 
bine them to achieve significant compression rates with minimal 
time and predictive performance tradeoffs. Before detailing the 
algorithms, we now briefly discuss a generalized model struc¬ 
ture with accompanying notation. 

A machine learning model’s memory footprint can be 
viewed principally as a large map from feature name to numeric 
weight. In NLU typically, there is an initially large universe U 
of potentially active or relevant features (such as all English bi¬ 
grams). Of those features, a subset S, whose cardinality can be 
much smaller than that of U , are the relevant parameters chosen 
by the learning algorithm using feature selection methods. The 
relevant features and their corresponding weights are stored in 
the map while the irrelevant parameters have 0 weight or are 
simply excluded from the model. For convenience we denote 
n = | S'| and assume all 0 weight parameters are excluded from 
S. 

At runtime, to use the model to evaluate an instance, a set 
S' £ U of parameters will be accessed. For MaxEnt and CRF, 
this instance parameter set will be small, | < S ,/ | |Sj. Thus, for 
each instance only a relative small number of parameters will be 
required to make a prediction. For example, if U is all English 
bigrams, S would be a smaller set of those bigrams which are 
useful features for the prediction task, and S' would be those 
bigrams present in a single utterance. Hence S IT S' are those 
parameters and weights needed to be accessed to predict on that 
instance. 

If the parameter map is implemented as a hash map, the 
model memory footprint in total bits will be 0(n • (s + w)) 
where n is the number of total parameters of the model, and s 
and w are the sizes of the parameter name and weight respec¬ 
tively in bits. The expected lookup time for a parameter is then 
0(s + w), which is constant for bounded s and w. Our goal 
is reduce this footprint while maintaining the lookup cost with 
little to no predictive performance degradation. 

3.3. Quantization 

Our initial step to model compression is model parameter quan¬ 
tization. To apply parameter quantization, we first choose a set 
of representative value cluster centers and then assign each pa¬ 
rameter to its nearest cluster. When a parameter weight is ac¬ 
cessed, its representative value is used in-place of the original 
value during the computation. The advantage from a data stor¬ 
age perspective is that we now need only store the cluster iden¬ 
tifier at each entry in our map instead of the full weight. Each 
weight in the map will be replaced by its cluster index which 


requires only 0( log k) bits where k is the number of clusters 
chosen. Additionally, we must now store a small table of cor¬ 
responding weights mapping each index to the representative 
cluster centers. And to predict a new instance we execute the 
computation by looking up the quantized index for each feature 
of S fl S' then determining their quantized weights from the 
small table. 

In terms of space, our parameter name to quantized index 
map is now reduced to a size of 0(n(s+log k)) while our small 
table is of size 0(wk) for a total size of 0(n(s -l-log k)+wk). 
In terms of runtime speed, the expected lookup time remains 
0(s 4- w) with an additional cache miss due to the use of a sec¬ 
ond table. Using 256 bins requires only 8 bits per entry value to 
store the weights, as opposed to the 64 bits required for double 
precision or 32 bits for float precision. 

For the predictive performance tradeoff, the two hyperpa¬ 
rameters are the number of centroids k and the method for 
choosing the centers. Choosing the cluster centers for quanti¬ 
zation can sometimes be a challenging task and many methods 
have been proposed. In the traditional linear quantization the 
clusters are chosen by evenly partitioning the range between 
min and max weight values. We find that for our purpose, lin¬ 
ear quantization yields adequate predictive performance results. 
The reason is that it rounds many small parameter values to 
zero, and preserves the larger weights that affect predictive per¬ 
formance. If the cluster centers were designed to follow the dis¬ 
tribution of parameter values (peaky distribution around zero), 
this rounding effect would be smaller and the larger more im¬ 
portant weights would also have less precision. 

3.4. Perfect Hashing 

Examining the total memory cost after quantization, we find 
that the 0(ns) term dominates the memory footprint. However, 
at runtime, we can replace the full feature names set S using an 
elegant application of perfect hashing. 

A perfect hash function maps our set S of n keys into m 
buckets with no collisions and better yet a minimal perfect hash 
function (MPHF) hashes our set S of n keys into n buckets with 
no collisions. If we had MPHF, then we just need to store an ar¬ 
ray of quantized indices and at runtime use the MPHF to index 
to the values of the parameters required for that instance. The 
challange is to find a hash function that achieves no collisions, 
is quick to evaluate, and requires little storage space. Here we 
describe a method which achieves 0(n) expected space and 
0(n ) expected construction runtime, which is a variation of the 
method given in ED 

Before describing the algorithm, we assume that we have 
access to a universal hash family from which we can draw 
pseudo-random hash functions, i.e. there is a set of hash 
functions ho, hi, h. 2 , ■ ■ . where hi is the hash function with 
seed i and each hi meets the simple uniform hashing as¬ 
sumption (SUHA). SUHA states each element hashed has an 
equal chance of being hashed to each bucket, meaning that 
Pr [hi(x) mod m = i] = 1/m for all choices of i, x, and 

i £ [0, m — 1], We also assume computing hi(x) is linear in 
the size of x. In practice, we implement this with a seeded 
MurmurHash EB. With these assumptions in place, in Algo¬ 
rithm [I] we outline the procedure for constructing a minimal 
perfect hash function from a set of keys. 

Note that in Algorithm[T] we have a single unique 1 set for 
each element of S through B\, B 2 , ■ ■ ■ ■ To find the hash of an 
element x at runtime, we hash level by level until we hash to a 
1. We then need to associate that 1 with a unique index in range 


Algorithm 1 Minimal Perfect Hash Function Construction 

1: Set Si = S 

2: for i = 0.1.3,.. do 

3: Choose a hash function hi. 

4: Initialize a bit array Bi of size \Si | to zeros. 

5: Hash all x £ Si to range [0, |Si| — 1]. 

6: if single entry of Si gets hashed by hi to position j then 

7: Set Bi\j] = 1. 

8: end if 

9: Set Si+i ={i£ Si where B,\j] ^ 1}. 

10: if |Si+i| = 0 then 

11: break 

12: end if 

13: end for 


[0,7i—l]. This is done by viewing the bit arrays as one giant 
bit array B = B\ © B 2 © B 3 © • • • concatenating the arrays 
together and then assigning each 1 to its rank, i.e. the number 
of l’s preceding it in B. This defines our minimal perfect hash 
function which we denote as h*. 

Concerning evaluation time, computing the rank can be 
done efficiently by using what are known as succinct data struc¬ 
tures (23|25] 126). We break B into chunks and store preaggre¬ 
gate rank sums computed at the chunk level. To find the rank of 
an index, we first look up the chunk level l’s count for the chunk 
containing the queried index. Then we linearly scan the contain¬ 
ing chunk to compute the rank of the index inside the chunk. We 
return these two numbers added together. We can theoretically 
achieve constant time rank computation with a linear number 
of extra bits by having a multilevel chunking scheme (chunks 
within chunks). In practice, a simple one level scheme works 
well, especially since using bitwise operations during the linear 
scan of a chunk is particularly CPU cache efficient. 

We now discuss how our minimum perfect hashing algo¬ 
rithm affects predictive performance. For every element x G 
S' fl S, the algorithm will return the correct index of that pa¬ 
rameter weight. However, the lookup algorithm for x G S' \ S 
will either reach a 0 at the bottom level bit array, in which case 
we know for certain the feature has a 0 weight associated with it, 
or otherwise the x will collide with another arbitrary parameter. 
This behavior is undesirable, and unless we explicitly store each 
key, which we are trying to avoid, it is impossible to guarantee 
no false-positives. However, we can reduce the false-positive 
rate at the cost of a few extra bits per entry. The idea is to store 
an extra array F of entry “fingerprints”. Given hash function 
/, we store F[h*(x)] = f(x) mod (1/e) where e is the de¬ 
sired false-positive rate. The fingerprint length will be log(l/e) 
bits in length and by SUHA the likelihood that two entries have 
matching fingerprints is e. Hence to lookup a weight we take 
the extra step of checking that its fingerprint matches. If the fin¬ 
gerprint does not match we can report a weight of 0 and that the 
parameter is not present in the model with 100% certainty. Oth¬ 
erwise, we report the weight hashed to with 1 — e confidence. 

From a space perspective, storing h* is small relative to 
the original O(sn) cost of storing each of the keys. It can be 
shown that h* will have size O(n) bits with high probability, 
but in order for the MPHF algorithm to be effective for our 
compression application, the hidden constant factors will need 
to be small. For the statistical models we deal with which com¬ 
monly have n > 500, 000 parameters, the size of h * is less than 
3.4 + log(l/e) bits per entry. 

We have addressed the predictive performance and space 


tradeoffs of applying the perfect hashing technique, so we last 
discuss the evaluation time tradeoff. We pay little extra in 
lookup time to use the perfect hashing algorithm. The expected 
evaluation time of h*(x) for x G U is 0(s + log(l/e)) with 
a constant number of expected cache misses. Table Q] below 
summarizes the tradeoffs discussed when applying the two tech¬ 
niques, quantization and perfect hashing, for compressing the 
models. 


Table 1: Summary of compressed models tradeoff 



Normal 

Compressed 

Parameter Access Time 

0 (s + w) 

0 (s + w + log(l/e)) 

Total Space 

0 {n(s + w)) 

0 (n log (k/e) + wk) 

Predictive Performance 
Impact 

baseline 

granularity loss, 
e false-positive rate 


4. Experimental Results 

In this section we present experimental results using our com¬ 
pression techniques. We apply them to large scale cloud NLU 
models from six domains suitable for local voice-assistants, 
such as temperature control and navigation. All of these models 
have feature counts over 500,000 and many have beyond sev¬ 
eral million. We first discuss the model size reduction and then 
effect on the predictive performance. 

Note that for skill NLU models the results are similar and 
not provided here. 

4.1. Compression 

The compression results are given in Tabled After applying 
the model size reduction techniques, we see significant com¬ 
pression rates compared to the normal statistical models. We 
used hyperparameters of k = 256 =>• log k = 1 byte and false¬ 
positive rate t = 0.0001. We had experimented with varying 
k but found k = 256 a desirable choice because it gave us ad¬ 
equate predictive performance and is programmatically conve¬ 
nient since each cluster index can be stored with a whole byte. 
We achieve a significant 14.25-fold memory footprint reduction 
(567.2 MB compared to 39.8MB) and for some models have a 
compression ratio as high as a 31.5 (Domain 1 IC). 

The MaxEnt DC total compression rate is lower that the 
MaxEnt IC models, 12.3-fold vs. 24.7 fold. The reason is that 
for the normal DC we use the feature hashing trick during train¬ 
ing. Thus, instead of storing a map from string feature name 
to feature id, we store a 32-bit integer hash to feature id map. 
Since our DC models have millions of parameters, this integer 
map takes significant memory, and using our approach reduces 
the memory requirement for each key from 32-bit value to less 
than 3.4 + log(l/e). 

The CRF NER total compression rate is around 10.8-fold, 
which is lower than MaxEnt. The reason is that CRF models 
have greater structural complexity than MaxEnt, and we need 
to maintain additional information on state transitions and state 
observation that we do not quantize and hash. 

Without fingerprinting (e = 1), we obtain 25.3-fold mem¬ 
ory footprint reduction (567.2 MB compared to 22.4MB). This 
is around 43% lower compared to e = 0.0001. From Tabled 
we note that for IC and DC models the fingerprints consume 
more than half of the memory. However, without fingerprint¬ 
ing the false-positives from parameter access affect predictive 
performance which is described in the next section. 















Table 2: NLU statistical models sizes in megabytes. Normal vs. compressed (e = 0.0001) vs. compressed* (e = 1) 



Normal 

DC 

Comp. 

Comp.* 

Normal 

IC 

Comp. 

Comp.* 

Normal 

NER 

Comp. 

Comp.* 

Normal 

All 

Comp. 

Comp* 

Domain 1 

27.2 

2.2 

0.9 

42. 

1.3 

0.4 

14.0 

1.5 

1.2 

83.2 

5.0 

2.5 

Domain 2 

25.8 

2.1 

0.9 

65.4 

1.4 

0.5 

86.9 

7.6 

5.5 

178.1 

11.1 

6.9 

Domain 3 

13.1 

1.1 

0.4 

0.6 

0.1 

0.002 

1.0 

0.2 

0.1 

14.7 

1.4 

0.5 

Domain 4 

10.9 

0.9 

0.4 

9.5 

0.6 

0.2 

2.5 

0.4 

0.3 

22.9 

1.9 

0.9 

Domain 5 

25.3 

2.0 

0.9 

25.7 

1.3 

0.5 

84.7 

7.4 

5.6 

135.7 

10.7 

7.0 

Domain 6 

51.5 

4.1 

1.7 

64.8 

3.7 

1.3 

16.4 

1.9 

1.6 

132.6 

9.7 

4.6 

Total 

153.7 

12.4 

5.2 

208.0 

8.4 

2.9 

205.5 

19.0 

14.3 

567.2 

39.8 

22.4 


4.2. Predictive Performance 

We evaluate model performance on two large test datasets with 
hundreds of thousand annotated utterances: 

• Supported Domains (SD) Test set: Contains utterances 
from the six local supported domains. 

• Out of Domain (OOD) Test set: This includes utterances 
that do not map to any intent or background noise. 

We use the following evaluation metrics: 

• Slot Error Rate (SER) f27) is a slot level metric that evalu¬ 
ates the over all predictive performance of the models. SER 
is defined as the ratio of the number of slot prediction errors 
to the total number of reference slots. Errors could be in¬ 
sertions, substitutions and deletions. Intent misrecognitions 
are considered substitutions. 

• Intent Classification Error Rate (ICER) utterance level met¬ 
ric. ICER is defined as the ratio of the number of intent 
misclassifications to the total number of utterances. 

• F-ICER is a balanced ICER metric that considers both pre¬ 
cision and recall. We compute it as 1 - FI score. 

• Rejection Rate is defined as the percentage of utterances 
with scores below a set threshold. Below threshold utter¬ 
ances are rejected by the system. 

An NLU system ideally has a low SER and ICER/F-ICER on 
the SD test set indicating good model predictive performance 
and a high rejection rate on the OOD test set indicating that out 
of domain utterances are rejected. 

Table[3]details the overall performance measures and Table 
[4]details the per domain performance measures. The results are 
percentage relative compared to the normal models, as we are 
unable to disclose absolute numbers. 

As shown from the results, our compressed models perform 
almost as well as our baseline models with acceptable over¬ 
all relative error increases of +0.86% in SER and +0.26% in 
ICER. The per domain compressed results show small F-ICER 
increase of less than +1% except Domain 3 with +1.58%. The 
reason is that Domain 3 IC model has a small number of promi¬ 
nent features and false-positive on important features are more 
common. 

The ultra compressed models without fingerprinting (e = 
1) have overall relative error increases of +2.20% in SER and 
+3.14% in ICER. The per domain relative error increases are 
around +2% to +3% F-ICER for most domains. The error rates 
of the ultra compressed models are higher than the compressed 
models with fingerprinting. However, for the relative error in¬ 
crease of around +1 to +2%, we obtain total 25.3-fold memory 
reduction compared to 14.25-fold. Thus, compression without 
fingerprinting could be a viable option depending on the mem¬ 
ory constraints and predictive performance requirements. 


Table 3: Overall predictive performance measures for NLU sta¬ 
tistical models. 


Model 

SD Dataset 

OOD Dataset 

SER 

ICER 

Rejection Rate 

Compressed 

+0.86% 

+0.26% 

-0.08% 

Compressed* 

+2.20% 

+3.14% 

-0.72% 


Table 4: Domain predictive performance measures for NLU sta¬ 
tistical models on the SD test set. 


Domain 

Compressed 

F-ICER 

Compressed* 

F-ICER 

Domain 1 

+0.01% 

+2.33% 

Domain 2 

+0.74% 

+2.36% 

Domain 3 

+1.58% 

+8.94% 

Domain 4 

+0.01% 

+ 1.75% 

Domain 5 

+0.32% 

+ 1.59% 

Domain 6 

+0.20% 

+3.15% 


Note that the reason why using no fingerprinting performs 
adequately is because when we get a false-positive, it is not as 
if an adversarial index is hashed and the system is guaranteed 
to make an incorrect prediction. Rather, when a false-positive 
is realized, we actually are hashing to a random existing model 
parameter each with equal probability. Since majority of our pa¬ 
rameters are close to zero, the false-positives add small amount 
of noise to the predictions. 

5. Conclusion 

In this paper we presented approaches to reduce the mem¬ 
ory footprint of NLU statistical models to work on resource- 
constrained embedded systems, and decrease latency of loading 
skill NLU models. We demonstrated the effectiveness of our 
techniques in reducing memory footprint while addressing the 
the tradeoffs of time, space, and predictive performance. We ob¬ 
served the methods sacrifice minimally in terms of model eval¬ 
uation time and predictive performance for the substantial com¬ 
pression gains observed. It would be interesting to go beyond 
the results of Section [33l to see if there is a better quantization 
scheme for our models. 
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